Discriminative training of HMM models using maximum margin estimation for speech recognition

ABSTRACT

An improved discriminative training method is provided for hidden Markov models. The method includes: defining a measure of separation margin for the data; identifying a subset of training utterances having utterances misrecognized by the models; defining a training criterion for the models based on maximizing the separation margin; formulating the training criterion as a constrained minimax optimization problem; and solving the constrained minimax optimization problem over the subset of training utterances, thereby discriminatively training the models.

FIELD OF THE INVENTION

The present invention relates generally to discriminative model training and, more particularly, to an improved method for discriminative training of hidden Markov models (HMMs) based on maximum margin estimation.

BACKGROUND OF THE INVENTION

Discriminative training has been extensively studied over the past decade and has proved to be quite effective for improving automatic speech recognition performance. Minimum classification error (MCE) and maximum mutual information (MMI) are two of the more popular discriminative training methods. Despite their significant progress in this area, many issues related to discriminative training remain unsolved. One issue reported by many researches is that discriminative training methods for speech recognition suffer from the problem of poor generalization capability. In other words, discriminative training can dramatically reduce the error rate for the training data but such significant performance gains cannot be maintained for unseen test data.

Therefore, it is desirable to provide a discriminative training method for hidden Markov models which improves the generalization capability of the models.

SUMMARY OF THE INVENTION

An improved discriminative training method is provided for hidden Markov models. The method includes: defining a measure of separation margin for the data; identifying a subset of training utterances having utterances misrecognized by the models; defining a training criterion for the models based on the principle of maximizing the separation margin; formulating the training criterion as a constrained minimax optimization problem; and solving the constrained minimax optimization problem over the subset of training utterances, thereby discriminatively training the models.

Further areas of applicability of the present invention will become apparent from the detailed description provided hereinafter. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

In automatic speech recognition, given any speech utterance X, a speech recognizer will choose the word Ŵ as output based on the MAP decision rule as follows: $\begin{matrix} \begin{matrix} {\hat{W} = {\underset{w}{\arg\quad\max}{P\left( {W\text{|}X} \right)}}} \\ {= {\underset{w}{\arg\quad\max}{{P(W)} \cdot {P\left( {X\text{|}W} \right)}}}} \\ {= {\underset{w}{\arg\quad\max}{{P(W)} \cdot {P\left( {X\text{|}\lambda_{W}} \right)}}}} \\ {= {\underset{w}{\arg\quad\max}{F\left( {X\text{|}\lambda_{W}} \right)}}} \end{matrix} & (1) \end{matrix}$ where λ_(w) denotes the HMM representing the word W and F(X|λ_(w))=P(W)·P(X|λ_(w)) is called discriminant function. Depending on the problem of interest, a word W is used herein to mean any linguistic unit, such as a phoneme, a syllable, a word, a phrase or a sentence. For discussions purposes, this work is focused on hidden Markov models λ_(w) and assume P(W) is fixed. While the following description is provided with reference to hidden Markov models, it is readily understood that the broader aspects of the present invention are also applicable to other types of acoustic models.

For a speech utterance, X_(i) assuming its true word identity as W_(i) ^(T), the multi-class separation margin for X_(i) is defined as: $\begin{matrix} \begin{matrix} {{d\left( X_{i} \right)} = {{F\left( {X_{i}\text{|}\lambda\quad w_{i}^{T}} \right)} - {\max\limits_{w_{j} \in {{\Omega\quad w_{j}} \neq w_{i}^{T}}}{F\left( {X_{i}\text{|}\lambda\quad w_{j}} \right)}}}} \\ {= {\min\limits_{w_{j} \in {{\Omega\quad w_{j}} \neq w_{i}^{T}}}\left\lbrack {{F\left( {X_{i}\text{|}\lambda\quad w_{i}^{T}} \right)} - {F\left( {X_{i}\text{|}\lambda\quad w_{j}} \right)}} \right\rbrack}} \end{matrix} & \begin{matrix} (2) \\ \quad \\ (3) \end{matrix} \end{matrix}$ where Ω denotes the set of all possible words.

Obviously, if d(X_(i))<0, X_(i) will be incorrectly recognized by the current HMM set, denoted as Λ; if d(X_(i))>0, X_(i) will be correctly recognized by the models Λ.

Given a set of training data D={X₁, X₂ . . . , X_(N)}, we usually know the true word identities for all utterances in D, denoted as L={W₁ ^(T), W₂ ^(T), . . . W_(N) ^(T)}. Thus, we can calculate the separation margin (also referred to hereafter as margin) for every utterance in D based on the definition in equation (2) or (3). If we want to estimate the HMM parameters Λ, one desirable estimation criterion is to minimize the total number of utterances in the whole training set which have negative margins as in the standard MCE estimation. Furthermore, motivated by the large margin principle in machine learning, even for those utterances which all have positive margins, we may still want to maximize the minimum margin among them towards an HMM-based large margin classifier. Based on the machine learning theory, a large margin classifier usually leads to a much lower generalization error rate in a new testing set and shows a more robust and better generalization capability. In this report, we will show how to estimate HMMs for speech recognition based on the above-mentioned principle of maximizing minimum multi-class separation margin.

First of all, from all utterances in D, we need to identify a subset of utterances, S={X _(i) |X _(i) εD and 0≦d(X _(i))≦γ}  (4) where γ>0 is a pre-set positive number. Analogically, we call S as support vector set and each utterance in S is called a support token, which has relatively small positive margin among all utterances in training set D. In other words, all utterances in S are relatively close to the classification boundary even though all of them locate in the right decision regions. To achieve a better generalization power, it is desirable to adjust decision boundaries, which are implicitly determined by all models, through optimizing HMM parameters Λ to make all support tokens as far from the decision boundaries as possible, which will result in a robust classifier with better generalization capability. This idea leads to estimating the HMM models Λ based on the criterion of maximizing the minimum margin of all support tokens, which is named as large margin estimation (LME) or maximum margin estimation (MME) of HMM: $\begin{matrix} {\overset{\sim}{\Lambda} = {\underset{\Lambda}{\arg\quad\max}\quad{\min\limits_{X_{i} \in S}{d\left( X_{i} \right)}}}} & (5) \end{matrix}$ where the above maximization and minimization are performed subject to the constraints that d(X_(i))>0 for all X_(i)εS. The HMM models, {tilde over (Λ)}, estimated in this way, are called large margin or maximum margin HMMs. For simplicity of explanation, we will only use the term large margin estimation hereafter.

Considering equation (3), large margin HMMs can be equivalently estimated as follows: $\begin{matrix} {\overset{\sim}{\Lambda} = {\underset{\Lambda}{\arg\quad\max}\quad{\min\limits_{{X_{i} \in S},\quad{w_{j} \in \quad\Omega},\quad{w_{j} \neq w_{i}^{T}}}\left\lbrack {{F\left( {X_{i}\text{|}\lambda\quad w_{i}^{T}} \right)} - {F\left( {X_{i}\text{|}\lambda\quad w_{j}} \right)}} \right\rbrack}}} & (6) \end{matrix}$ subject to F(X _(i) |λW _(i) ^(T))−F(X _(i) |λw _(j))>0  (7) for all X_(i)εS and w_(j)εΩw_(j)≠W_(i) ^(T).

Finally, the above optimization can be converted into a standard minimax optimization problem as: $\begin{matrix} {\overset{\sim}{\Lambda} = {\underset{\Lambda}{\arg\quad\max}\quad{\max\limits_{{X_{i} \in S},\quad{w_{j} \in \quad\Omega},\quad{w_{j} \neq w_{i}^{T}}}\left\lbrack {{F\left( {X_{i}\text{|}\lambda\quad w_{j}} \right)} - {F\left( {X_{i}\text{|}\lambda\quad w_{i}^{T}} \right)}} \right\rbrack}}} & (8) \end{matrix}$ where the minimax optimization is subject to the following constraint: F(X _(i) |λw _(j))−F(X _(i) |λW _(i) ^(T))<0  (9) for all X_(i)εS and w_(j)εΩw_(j)≠w_(i) ^(T).

Since large margin estimation is derived from support vector machines in machine learning, the definition of the training set is analogous to that of the support vector set for support vector machines as seen in equation (4) above. In other words, the support vector set only consists of positive tokens (i.e., training data correctly recognized by the baseline model). Negative or misrecognized tokens are discarded in the large margin estimation approach. As a result, large margin estimation typically uses minimum classification error training to bootstrap the training (i.e., uses the MCE model as a seed model to start the training).

The present invention proposes to further include the negative tokens in the support vector set. A new definition of the support vector set is defined as follows: S={X _(i) |X _(i) εD and d(X _(i))≦γ}  (10) where γ is a positive constant. In other words, a subset of training data is identified which includes data misrecognized by the models. However, the subset of training data may also include data correctly recognized by the models. Accordingly, the minimax optimization problem may be solved using this new support vector set. It is readily understood that different optimization approaches for solving this problem are within the scope of the present invention.

Assuming there are misrecognized tokens, the minimization in the criterion of equation (5) will chose the most negative token which is farthest from the decision boundary and locates in the wrong decision region. This is very different from the original large margin estimation training where the minimization will always choose the token that is nearest to the decision boundary but locates in the correct decision region. According to the criterion, the maximization will push the negative tokens to cross the decision boundaries so they will have positive margins. This is similar to the minimum classification error training but in a more direct and effective fashion. In this way, large margin estimation no longer needs to use MCE to bootstrap, thereby completely removing any need for MCE in the training process.

The present invention directly applies the large margin estimation (LME) to both misrecognized data and correctly recognized data, as opposed to previous method in which only correctly recognized training data can be used in the training. It takes full benefit of LME because more training data participate in the training, and therefore can achieve higher accuracy than the existing LME method. Furthermore, in large vocabulary continuous speech recognition (LVCSR) tasks, only a very small percentage of training data will be correctly recognized by the baseline models. In the previous LME method, the benefit of large margin estimation will be greatly limited due to lack of applicable training data, or it may not be able to apply at all when none of the training data is correctly recognized, which is common for LVCSR tasks. But this invention has no such problem and can be directly applied to LVCSR tasks. Another advantage of this invention is that it does not need to use MCE to bootstrap the training as opposed to the existing LME method, so the overall training time is shorter.

Constraints for the large margin estimation do not guarantee the existence of a minimax point. As an illustration of this, let's assume a simple case with only two classes m1 and m2 and there is a support token X close to the decision boundary. If we pull m1 and m2 together at the same time, we can keep the boundary unchanged but increase the margin defined in equation (3) as much as we want. As models move toward X, the absolute values of both F(X|m1) and F(X|m2) increase, so does the margin as well, although the relative position of X related to the boundary actually does not change at all.

More constraints must be introduced in the minimax optimization procedure to make sure that the optimal point does exist. In one exemplary approach, a localized optimized strategy is adopted. Rather than optimizing parameters of all models at the same time, only one selected model is adjusted in each step and then the process iterates to update another model until the minimum margin is maximized.

The iterative localized optimization may be summarized as follows:

-   -   Repeat         -   1. Identify the support set S based on the current model set             Λ^((n)).         -   2. Choose the support token, to say X_(k), from S which             currently gives the minimum margin; Choose the true model of             X_(k), to say λ_(k) ^((n)) for optimization in this             iteration.         -   3. Minimizing the margin by ONLY updating the model λ_(k):             λ_(k) ^((n))             λ_(k) ^((n+1)).         -   4. n=n+1.     -   until some convergence conditions are met.

In the above iterative localized optimization method, in each iteration, only one model, to say λ_(k), is updated based on the minimax optimization given in equation (8) so that we only need to consider those functions which are relevant to the currently selected model. The minimax optimization can be re-formulated as: $\begin{matrix} {\lambda_{k}^{({n + 1})} = {\arg\quad\min\quad{\max\limits_{{X_{i} \in S},\quad{i \neq_{{j\quad j} = {{k\quad{or}\quad i} = k}}}}\left\lbrack {{F\left( {X_{i}\text{|}\lambda\quad w_{j}} \right)} - {F\left( {X_{i}\text{|}\lambda\quad w_{i}^{T}} \right)}} \right\rbrack}}} & (11) \end{matrix}$ subject to the constraints in equation (10). This localized minimax optimization can be numerically solved by using some optimization software tools. Given a large number of parameters in HMMs, it is usually too slow to use a general purpose minimax tool to solve this optimization problem.

One alternative is to use a GPD-based algorithm to solve the minimax problem in equation (11) in an approximate way. First of all, based on equation (11), we construct a differentiable objective function as follows: $\begin{matrix} {{Q\left( \lambda_{k} \right)} = {\frac{1}{\eta}\log\left\{ {\sum\limits_{{X_{i} \in \quad{{S\quad j} \neq {i\quad i}}} = {{k\quad{or}\quad j} = k}}^{\quad}{\exp\left\lbrack {\eta\quad{F\left( {{X_{i}\text{|}\lambda_{W_{j}}} - {\eta\quad{F\left( {X_{i}\text{|}\lambda_{W_{i}}} \right)}}} \right\rbrack}} \right\}}} \right.}} & (12) \end{matrix}$ where η>1 is a constant. As η→∞, Q(λ_(k)) will approach the maximization in equation (11). Then, the GPD algorithm can be used to update the model parameters, λ_(k), in order to minimize the above approximate objective function, Q(λ_(k)).

Assume each speech unit, e.g., a word W, is modeled by an N-state CDHMM with parameter vector λ=(π, A, θ), where π is the initial state distribution, A={a_(ij)|1≦i, j≦N} is transition matrix, and θ is parameter vector composed of mixture parameters θ_(i)={w_(ik), m_(ik), r_(ik)}_(k)=1, 2, . . . , _(k) for each state i, where K denotes number of Gaussian mixtures in each state. The state observation p.d.f. is assumed to be a mixture of multivariate Gaussian distribution. In many cases, we prefer to use multivariate Gaussian distribution with diagonal precision matrix. Given any speech utterance X_(i)={x_(i1), x_(i2), . . . xi_(R)}, F(X_(i)|λw_(j)) can be calculated as: $\begin{matrix} {{F\left( {X_{i}\text{|}\lambda\quad w_{j}} \right)} = {{{\log\left( {{P\left( {{Xi}\text{|}\lambda\quad w_{j}} \right)}{P\left( w_{j} \right)}} \right)}\quad\log\quad{P\left( W_{j} \right)}} + {\log\quad{\prod\limits_{S_{1}}{+ {\sum\limits_{t = 2}^{T}{\log\quad a_{S_{t - 1}S_{t}}}}}}} + {\frac{1}{2}{\sum\limits_{t = 1}^{T}{\sum\limits_{d = 1}^{D}\left\lbrack {{\log\quad r_{S_{t}l_{t}d}} - {r_{S_{t}l_{t}d}\left( {x_{itd} - m_{S_{\quad t}l_{t}d}} \right)}^{2}} \right\rbrack}}}}} & (13) \end{matrix}$

Here we only consider a simple case, where we only re-estimate mean vectors of CDHMMs based on the large margin principle while keeping all other CDHMM parameters constant during the large margin estimation. For any utterance X_(i) in the support token set S, we can re-write F(X_(i)|λ_(i)) and F(X_(i)|λ_(j)) according to equation (13) as follows: $\begin{matrix} {{F\left( {X_{i}\text{|}\lambda_{i}} \right)} \approx {C^{\prime} - {\frac{1}{2}{\sum\limits_{t = 1}^{T}{\sum\limits_{d = 1}^{D}\left\lbrack {{\log\quad r_{S_{t}^{\prime}l_{t}^{\prime}d}} - {r_{S_{t}^{\prime}l_{t}^{\prime}d}\left( {x_{itd} - m_{S_{t}^{\prime}l_{t}^{\prime}d}} \right)}^{2}} \right\rbrack}}}}} & (14) \\ {{F\left( {X_{i}\text{|}\lambda_{j}} \right)} \approx {C^{''} - {\frac{1}{2}{\sum\limits_{t = 1}^{T}{\sum\limits_{d = 1}^{D}\left\lbrack {{\log\quad r_{S_{t}^{''}l_{t}^{''}d}} - {r_{S_{t}^{''}l_{t}^{''}d}\left( {x_{itd} - m_{S_{t}^{''}l_{t}^{''}d}} \right)}^{2}} \right\rbrack}}}}} & (15) \end{matrix}$ where C′ and C″ are two constants independent from mean vectors. In this case, the discriminant functions F(X_(i)|λ_(i)) and F(X_(i)|λ_(j)) can be represented as a summation of some quadratic functions related to mean values of CDHMMs. Then we can represent the decision margin F(X_(i)|λ_(i))−F(X_(i)|λ_(j)) as:

From eqs. (12) and (16), it is straightforward to calculate the gradient of the objective function, Q(λ_(k)), with respect to each mean vector in the model λ_(k).

At last, we can use the GPD algorithm to adjust λ_(k) to minimize the objective function as follows: $\begin{matrix} {{{{F\left( {X_{i}\text{|}\lambda_{i}} \right)} - {F\left( {X_{i}\text{|}\lambda_{j}} \right)}} \approx {C - {\sum\limits_{t = 1}^{T}{\sum\limits_{d = 1}^{D}\left\lbrack \quad{{r_{S_{t}^{\prime}l_{t}^{\prime}d}\left( {x_{itd} - m_{S_{t}^{\prime}l_{t}^{\prime}d}} \right)}^{2} - {r_{S_{t}^{''}l_{t}^{''}d}\left( {x_{itd} - m_{S_{t}^{''}l_{t}^{''}d}} \right)}^{2}} \right\rbrack}}}}{{{where}\quad C} = {C^{\prime} - C^{''}}}} & (16) \end{matrix}$ $\begin{matrix} {\mu_{sql}^{({n + 1})} = \left. {{\mu_{sql}^{(n)} -} \in \frac{\partial{Q\left( \lambda_{k} \right)}}{\partial\mu_{sql}}} \right|_{\lambda_{k} = \lambda_{k}^{(n)}}} & (17) \end{matrix}$ where μ_(sql) ^((n+1)) denotes the I-th dimension of Gaussian mean vector for the q-th mixture component of state S of HMM model λ_(k) at (n+1 )-th iteration.

In an alternative approach, the definition of margin may be changed to a relative separation margin as defined below: $\begin{matrix} {{\overset{\sim}{d}\left( X_{i} \right)} = {\min\limits_{w_{j} \in {{\Omega\quad w_{j}} \neq w_{i}^{T}}}\left\lbrack \frac{{F\left( X_{i} \middle| \lambda_{w_{i}^{T}} \right)} - {F\left( X_{i} \middle| {\lambda\quad w_{j}} \right)}}{F\left( X_{i} \middle| \lambda_{w_{i}^{T}} \right)} \right\rbrack}} & (18) \end{matrix}$

If the discriminant functions F(·) are defined as in equation (1), for all support tokens in the set S defined in equation (10), the relative margin d(X_(i)) will be less than 1. Since the relative margin has an upperbound by definition, the maximum value of relative margin always exists. However, in many cases, F(X_(i)|λ) is defined as the log-likelihood of X_(i) given model set Λ, so F(X_(i)|λw_(i) ^(T))<0. To make the relative margin meaningful (i.e., positive values for correctly recognized data and negative values for misrecognized data), we slightly modify its definition as: $\begin{matrix} {{\overset{\sim}{d}\left( X_{i} \right)} = {\min\limits_{w_{j} \in {{\Omega\quad w_{j}} \neq w_{i}^{T}}}\left\lbrack \frac{{F\left( X_{i} \middle| {\lambda\quad w_{j}} \right)} - {F\left( X_{i} \middle| \lambda_{w_{i}^{T}} \right)}}{F\left( X_{i} \middle| {\lambda\quad w_{j}} \right)} \right\rbrack}} & (19) \end{matrix}$ Thus, for correctly recognized data, F(X_(i)|λw_(j))<F(X_(i)|λW_(i) ^(T)), d(X_(i))>0. Similarly, we define the support vector set S as equation (10). Therefore, our new training criterion is defined as $\begin{matrix} {\overset{\sim}{\Lambda} = {\underset{\Lambda}{\arg\quad\min}\quad{\max\limits_{{x_{i} \in S},{w_{j} \in \Omega},{w_{j} \neq w_{i}^{T}}}\left\lbrack {\frac{F\left( X_{i} \middle| \lambda_{w_{i}^{T}} \right)}{F\left( X_{i} \middle| \lambda_{w_{j}} \right)} - 1} \right\rbrack}}} & (20) \end{matrix}$ where Ω denotes the set of all possible words. This technique is referred to large relative margin estimation (LRME) or maximum relative margin estimation (MRME) of HMMs. In this case, different optimization approaches can be used for updating all model parameters at the same time.

For example, an iterative approach is proposed based on the generalized probabilistic descent (GPD) algorithm. First, a differentiable objective function is constructed. To do so, a summation of exponential functions to approximate the maximization in equation (20) as follows: $\begin{matrix} {\max\limits_{{X_{i} \in S},{w_{j} \in \Omega},{w_{j} \neq w_{i}^{T}}}\left\lbrack {\frac{F\left( X_{i} \middle| \lambda_{w_{i}^{T}} \right)}{F\left( X_{i} \middle| {\lambda\quad w_{j}} \right)} - 1} \right\rbrack} & \quad \\ \begin{matrix} {\approx {\log\left\{ {\sum\limits_{{X_{i} \in S},{w_{j} \in \Omega},{w_{j} \neq w_{i}^{T}}}{{\exp\left\lbrack {\eta\quad{d\left( {X_{i},\lambda_{w_{j}},\lambda_{w_{i}^{T}}} \right\rbrack}} \right\}}^{1\text{/}\eta}{d\left( {X_{i},{\lambda\quad}_{W_{j}},{\lambda\quad}_{W_{i}^{T}}} \right)}}} \right.}} \\ {= d_{ij}} \\ {= {\frac{F\left( X_{i} \middle| \lambda_{w_{i}^{T}} \right)}{F\left( X_{i} \middle| {\lambda\quad w_{j}} \right)} - 1}} \end{matrix} & (21) \end{matrix}$ where η>1. As η→∞, the continuous function in the right hand side of equation (21) will approach the maximization in the left hand side.

Therefore, we define the objective function as: $\begin{matrix} {{Q(\Lambda)} = {\frac{1}{\eta}\log\left\{ {\sum\limits_{{X_{i} \in S},{w_{j} \in \Omega},{w_{j} \neq w_{i}^{T}}}{\exp\left( {\eta\quad d_{ij}} \right)}} \right\}}} & (22) \\ {\quad{= {\frac{1}{\eta}\log\left\{ {\sum\limits_{{X_{i} \in S},{w_{j} \in \Omega},{w_{j} \neq w_{i}^{T}}}{\sum{\exp\left( {\eta\quad d_{ij}} \right)}}} \right\}}}} & (23) \\ {\quad{= {\frac{1}{\eta}\log\quad Q_{1}}}} & (24) \end{matrix}$

Now, we can use GPD algorithm to adjust Λ to minimize the objective function of Q(Λ). To maintain HMM model constraints during the optimization process, we need to define the same transformations for model parameters as known in minimum classification error training methods. For Gaussian means, the transformation is ${\overset{\sim}{\mu}}_{skl}^{m} = \frac{\mu_{skl}^{m}}{\sigma_{skl}^{m}}$ where {tilde over (μ)}_(skl) ^(m) is the transformed Gaussian mean, μ_(skl) ^(m) and σ_(skl) ^(m) are the original Gaussian mean and variance, respectively. Then it can be shown that the iterative adjustment of Gaussian means follows $\begin{matrix} {{{\overset{\sim}{\mu}}_{skl}^{m}\left( {n + 1} \right)} = {\left. {{{{\overset{\sim}{\mu}}_{skl}^{m}(n)} -} \in \frac{\partial{Q(\Lambda)}}{\partial{\overset{\sim}{\mu}}_{skl}^{m}}} \middle| \Lambda \right. = \Lambda_{n}}} & (25) \\ {{\mu_{skl}^{m}\left( {n + 1} \right)} = {\sigma_{skl}^{m}\quad{{\overset{\sim}{\mu}}_{skl}^{m}\left( {n + 1} \right)}}} & (26) \end{matrix}$ where μ_(skl) ^(m)(n+1) is the I-th dimension of Gaussian mean vector for the k-th mixture component of state s of HMM model m at n+1 iteration. $\begin{matrix} {\frac{\partial{Q(\Lambda)}}{\partial Q_{1}} = {\frac{1}{\eta}\frac{1}{Q_{1}}}} & (27) \\ \begin{matrix} {\frac{\partial Q_{1}}{\partial{\overset{\sim}{\mu}}_{skl}^{m}} = {\sum\limits_{X_{i} \in S}\left\{ {\sum\limits_{{w_{j} \in \Omega},{w_{j} \neq w_{i}^{T}}}{\eta\quad{\exp\left( {\eta\quad d_{ij}} \right)}\frac{\partial d_{ij}}{\partial{\overset{\sim}{\mu}}_{skl}^{m}}}} \right\}}} \\ {= {\sum\limits_{X_{i} \in S}\left\{ {{{\delta\left( {W_{i}^{T} - m} \right)}\eta\frac{\partial{F\left( X_{i} \middle| \lambda_{m} \right)}}{\partial{\overset{\sim}{\mu}}_{skl}^{m}}{\sum\limits_{{W_{j} \in \Omega},{W_{j} \neq m}}\frac{\exp\left( {\eta\quad d_{ij}} \right)}{\left( X_{i} \middle| \lambda_{W_{j}} \right)}}} -} \right.}} \\ \left. {\left( {1 - {\delta\left( {W_{i}^{T} - m} \right)}} \right)\frac{F\left( X_{i} \middle| \lambda_{W_{i}^{T}} \right)}{F^{2}\left( X_{i} \middle| \lambda_{m} \right)}\eta\quad{\exp\left( {\eta\quad d_{ij}} \right)}\frac{\partial{F\left( X_{i} \middle| \lambda_{m} \right)}}{\partial{\overset{\sim}{\mu}}_{skl}^{m}}} \right\} \end{matrix} & (28) \end{matrix}$ where δ(W_(i) ^(T)−m)=1 when W_(i) ^(T)=m, that is, the true model for utterance X_(i) is the m-th model in the model set Λ. δ(W_(i) ^(T)−m)=0 when W_(i) ^(T)≠m. As $\begin{matrix} \begin{matrix} {{F\left( X_{i} \middle| \lambda_{m} \right)} = {\log\quad{L\left( X_{i} \middle| \lambda_{m} \right)}}} \\ {\approx {\log\quad{L\left( {X_{i},{q;\lambda_{m}}} \right)}}} \\ {= {{\sum\limits_{t = 1}^{T}\left\lbrack {{\log\quad a_{q_{t - 1}q_{t}}^{m}} + {b_{q_{t}}^{m}\left( x_{t} \right)}} \right\rbrack} + {\log\quad\pi_{q_{0}}^{m}}}} \end{matrix} & (29) \\ {{{b_{j}^{m}\left( x_{t} \right)} = {\sum\limits_{k = 1}^{K}{c_{jk}^{m}{N\left\lbrack {{x_{t};\mu_{jk}^{m}},R_{jk}^{m}} \right\rbrack}}}}{so}} & (30) \\ {\frac{\partial{F\left( X_{i} \middle| \lambda_{m} \right)}}{\partial{\overset{\sim}{\mu}}_{skl}^{m}} = {\sum\limits_{t = 1}^{T}{{\delta\left( {q_{t} - s} \right)}\frac{{\partial\log}\quad{b_{s}^{m}\left( x_{t} \right)}}{\partial{\overset{\sim}{\mu}}_{skl}^{m}}}}} & (31) \end{matrix}$ where $\begin{matrix} {\frac{{\partial\log}\quad{b_{s}^{m}\left( x_{i} \right)}}{\partial{\overset{\sim}{\mu}}_{skl}^{m}} = {{c_{sk}^{m}\left( {2\pi} \right)}^{- \frac{D}{2}}{R_{sk}^{m}}^{- \frac{1}{2}}\left( {b_{s}^{m}\left( x_{t} \right)} \right)^{- 1}\left( \frac{x_{tl} - \mu_{skl}^{m}}{\sigma_{skl}} \right)\exp\left\{ {{- \frac{1}{2}}{\sum\limits_{l = 1}^{D}\left( \frac{x_{tl} - \mu_{skl}^{m}}{\sigma_{skl}} \right)^{2}}} \right\}}} & (32) \end{matrix}$ D is the dimension of feature vectors. R_(sk) ^(m) is the covariance matrix for state s and Gaussian mixture component k for HMM model m. Here we assume it is diagonal. q is the best state sequence obtained by aligning X_(i) using HMM model λ_(m).

Combining equations from (27) to (32), we can easily obtain ∂Q(Λ)/∂{tilde over (μ)}_(skl) ^(m) for equation (25). Similar derivations for the variances, mixture weights and transition probabilities can be easily accomplished.

Note that there may be alterative definitions to the one given in equation (19). One alternative definition is $\begin{matrix} {{\overset{\sim}{d}\left( X_{i} \right)} = {\min\limits_{w_{j} \in {{\Omega\quad w_{j}} \neq w_{i}^{T}}}\left\lbrack \frac{{\exp\left( {F\left( X_{i} \middle| \lambda_{w_{i}^{T}} \right)} \right)} - {\exp\left( {F\left( X_{i} \middle| \lambda_{w_{j}} \right)} \right)}}{\exp\left( {F\left( X_{i} \middle| \lambda_{w_{j}^{T}} \right)} \right)} \right\rbrack}} & (33) \end{matrix}$ Based on the alternative definition, it is readily understood that the estimation formula for HMM model parameters can be derived.

The description of the invention is merely exemplary in nature and, thus, variations that do not depart from the gist of the invention are intended to be within the scope of the invention. Such variations are not to be regarded as a departure from the spirit and scope of the invention. 

1. A discriminative training method for hidden Markov models, comprising: defining a measure of separation margin for the data; identifying, based on the definition of the separation margin, a subset of training data having data misrecognized by the models; defining a training criterion for the models based on maximum margin estimation; formulating the training criterion as a minimax optimization problem; and solving the constrained minimax optimization problem over the subset of training data, thereby discriminatively training the models.
 2. The discriminative training method of claim 1 wherein each datum of the subset of training data has a separation margin from classification boundaries of the models which is equal to or less than a threshold value.
 3. The discriminative training method of claim 1 wherein the subset of training data, S, is S={X _(i) |X _(i) εD and d(X _(i))≦γ} where X_(i) is a datum in a set of training data D, d(X_(i)) is a separation margin for the datum X_(i) and γ is a constant threshold.
 4. The discriminative training method of claim 1 wherein the training criterion is further defined as $\overset{\sim}{\Lambda} = {\underset{\Lambda}{\arg\quad\max}\quad{\min\limits_{x_{i} \in S}\quad{d\left( X_{i} \right)}}}$ where Λ is an estimated set of models, X_(i) is a training datum in the subset of training data, S is the subset of training data and d(X_(i)) is a separation margin for the training datum.
 5. The discriminative training method of claim 1 wherein a maximum margin estimation is further defined as a large margin estimation or a large relative margin estimation.
 6. The discriminative training method of claim 4 wherein defining the separation margin is as follows ${d\left( X_{i} \right)} = {\min\limits_{w_{j} \in {{\Omega\quad w_{j}} \neq w_{i}^{T}}}\left\lbrack {{F\left( X_{i} \middle| \lambda_{w_{i}^{T}} \right)} - {F\left( X_{i} \middle| {\lambda\quad w_{j}} \right)}} \right\rbrack}$ such that the training criterion is defined as $\overset{\sim}{\Lambda} = {\underset{\Lambda}{\arg\quad\max}\quad{\min\limits_{X_{i} \in {S\quad w_{j}} \in {{\Omega\quad w_{j}} \neq w_{i}^{T}}}\left\lbrack {{F\left( X_{i} \middle| \lambda_{w_{i}^{T}} \right)} - {F\left( X_{i} \middle| {\lambda\quad w_{j}} \right)}} \right\rbrack}}$ where λ_(W) denotes a model representing a word W, F(X|λ_(W))=p(W) p(X|λ_(W)) and Ω denotes the set of all possible words.
 7. The discriminative training method of claim 6 wherein solving the constrained minimax optimization problem uses an iterative localized optimization algorithm.
 8. The discriminative training method of claim 4 wherein defining the separation margin is as follows ${\overset{\sim}{d}\left( X_{i} \right)} = {\min\limits_{w_{j} \in {{\Omega\quad w_{j}} \neq w_{i}^{T}}}\left\lbrack \frac{{F\left( X_{i} \middle| \lambda_{w_{j}} \right)} - {F\left( X_{i} \middle| \lambda_{w_{i}^{T}} \right)}}{F\left( X_{i} \middle| \lambda_{w_{j}} \right)} \right\rbrack}$ such that the training criterion is defined as $\overset{\sim}{\Lambda} = {\underset{\Lambda}{\arg\quad\min}\quad{\max\limits_{X_{i} \in {S\quad w_{j}} \in {{\Omega\quad w_{j}} \neq w_{i}^{T}}}\left\lbrack {\frac{F\left( X_{i} \middle| \lambda_{w_{i}^{T}} \right)}{F\left( X_{i} \middle| \lambda_{w_{j}} \right)} - 1} \right\rbrack}}$ where λ_(W) denotes a model representing a word W, F(X|λ_(W))=p(W) p(X|λ_(W)) and Ω denotes the set of all possible words.
 9. The discriminative training method of claim 4 wherein defining the separation margin is as follows ${\overset{\sim}{d}\left( X_{i} \right)} = {\min\limits_{w_{j} \in {{\Omega\quad w_{j}} \neq w_{i}^{T}}}\left\lbrack \frac{{\exp\left( {F\left( X_{i} \middle| \lambda_{w_{i}^{T}} \right)} \right)} - {\exp\left( {F\left( X_{i} \middle| \lambda_{w_{j}} \right)} \right)}}{\exp\left( {F\left( X_{i} \middle| \lambda_{w_{i}^{T}} \right)} \right)} \right\rbrack}$ such that the training criterion is defined as $\overset{\sim}{\Lambda} = {\underset{\Lambda}{\arg\quad\min}\left\lbrack {{\max\limits_{{X_{i} \in S},\quad{w_{j} \in \Omega},\quad{w_{j} \neq w_{i}^{T}}}{\exp\left( {{F\left( X_{i} \middle| \lambda_{w_{j}} \right)} - {F\left( X_{i} \middle| \lambda_{W_{i}^{T}} \right)}} \right)}} - 1} \right\rbrack}$ where λ_(W) denotes a model representing a word W, F(X|λ_(W))=p(W) p(X|λ_(W)) and Ω denotes the set of all possible words.
 10. The discriminative training method of claim 8 wherein solving the constrained minimax optimization problem uses a generalized probabilistic descent algorithm.
 11. The discriminative training method of claim 9 wherein solving the constrained minimax optimization problem uses a generalized probabilistic descent algorithm.
 12. A discriminative training method for hidden Markov models, comprising: defining a measure of separation margin for the data; defining a training criterion for the models based on maximum margin estimation; formulating the training criterion as a constrained minimax optimization problem; and solving the constrained minimax optimization problem over a subset of training utterances, where the subset of training utterances, S, is S={X _(i) |X _(i) εD and d(X _(i))≦γ} where X_(i) is a speech utterance in a set of training data D, d(X_(i)) is a separation margin for the speech utterance and γ is a predefined positive number.
 13. The discriminative training method of claim 12 wherein the training criterion is further defined as $\overset{\sim}{\Lambda} = {\underset{\Lambda}{\arg\quad\max}\quad\underset{X_{i} \in S}{\quad\min}{d\left( X_{i} \right)}}$ where Λ is an estimated set of acoustic models.
 14. The discriminative training method of claim 12 wherein a maximum margin estimation is further defined as a large margin estimation or a large relative margin estimation.
 15. The discriminative training method of claim 13 further comprises defining the separation margin as follows ${d\left( X_{i} \right)} = {\min\limits_{w_{j} \in {{\Omega\quad w_{j}} \neq w_{i}^{T}}}\left\lbrack {{F\left( \left. X_{i} \middle| \lambda_{W_{i}^{T}} \right.\quad \right)} - {F\left( X_{i} \middle| {\lambda\quad w_{j}} \right)}} \right\rbrack}$ such that the training criterion is defined as $\overset{\sim}{\Lambda} = {\underset{\Lambda}{{argmax}\quad}{\min\limits_{{Xi} \in {S\quad w_{j}} \in {{\Omega\quad w_{j}} \neq w_{i}^{T}}}\left\lbrack {{F\left( X_{i} \middle| \lambda_{W_{i}^{T}} \right)} - {F\left( X_{i} \middle| {\lambda\quad w_{j}} \right)}} \right\rbrack}}$ where λ_(W) denotes a model representing a word W, F(X|λ_(W))=p(W) p(X|λ_(W)) and Ω denotes the set of all possible words.
 16. The discriminative training method of claim 15 wherein solving the constrained minimax optimization problem uses an iterative localized optimization algorithm.
 17. The discriminative training method of claim 13 further comprises defining the separation margin as follows ${\overset{\sim}{d}\left( X_{i} \right)} = {\min\limits_{w_{j} \in {{\Omega\quad w_{j}} \neq w_{i}^{T}}}\left\lbrack \frac{{F\left( X_{i} \middle| \lambda_{w_{j}} \right)} - {F\left( X_{i} \middle| \lambda_{W_{i}^{T}} \right)}}{F\left( X_{i} \middle| \lambda_{w_{j}} \right)} \right\rbrack}$ such that the training criterion is defined as $\overset{\sim}{\Lambda} = {\underset{\Lambda}{argmin}\quad{\max\limits_{{X_{i} \in S},{w_{j} \in \Omega},{w_{j} \neq w_{i}^{T}}}\left\lbrack {\frac{F\left( X_{i} \middle| \lambda_{W_{i}^{T}} \right)}{F\left( X_{i} \middle| \lambda_{w_{j}} \right)} - 1} \right\rbrack}}$ where λ_(W) denotes a model representing a word W, F(X|λ_(W))=p(W) p(X|λ_(W)) and Ω denotes the set of all possible words.
 18. The discriminative training method of claim 13 further comprises defining the separation margin as follows ${\overset{\sim}{d}\left( X_{i} \right)} = {\min\limits_{w_{j} \in \quad{{\Omega\quad w_{j}} \neq w_{i}^{T}}}\left\lbrack \frac{{\exp\left( {F\left( {Xi} \middle| \lambda_{w_{i}^{T}} \right)} \right)} - {\exp\left( {F\left( {Xi} \middle| {\lambda\quad w_{j}} \right)} \right)}}{\exp\left( {F\left( {Xi} \middle| \lambda_{w_{i}^{T}} \right)} \right)} \right\rbrack}$ such that the training criterion is defined as $\overset{\sim}{\Lambda} = {\underset{\Lambda}{argmin}\left\lbrack {{\max\limits_{{{Xi} \in S},{w_{j} \in \quad\Omega},{w_{j} \neq w_{i}^{T}}}{\exp\left( {{F\left( X_{i} \middle| {\lambda\quad w_{j}} \right)} - {F\left( X_{i} \middle| \lambda_{W_{i}^{T}} \right)}} \right)}} - 1} \right\rbrack}$ where λ_(W) denotes a model representing a word W, F(X|λ_(W))=p(W) p(X|λ_(W)) and Ω denotes the set of all possible words.
 19. The discriminative training method of claim 17 wherein solving the constrained minimax optimization problem uses a generalized probabilistic descent algorithm.
 20. The discriminative training method of claim 18 wherein solving the constrained minimax optimization problem uses a generalized probabilistic descent algorithm.
 21. A discriminative training method for acoustic models, comprising: defining a measure of separation margin for the data; identifying a subset of training utterances having utterances recognized by the acoustic models and utterances misrecognized by the acoustic models; defining a training criterion for the acoustic models based on maximum margin estimation; formulating the training criterion as a minimax optimization problem; and solving the constrained minimax optimization problem over the subset of training utterances, thereby discriminatively training the acoustic models. 